13 research outputs found

    Model Based Clustering for Mixed Data: clustMD

    Get PDF
    A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded

    Clustering South African households based on their asset status using latent variable models

    Full text link
    The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status. A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure - this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD). The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings result, providing insight to the different socio-economic strata within the Agincourt region.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS726 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Model Based Clustering for Mixed Data: clustMD

    No full text
    A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.Science Foundation Irelan

    Model Based Clustering for Mixed Data: clustMD

    No full text
    A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.Science Foundation Irelan

    Clustering South African households based on their asset status using latent variable models

    Get PDF
    The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status. A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure—this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD). The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings result, providing insight to the different socio-economic strata within the Agincourt region.Science Foundation IrelandNIH grantsGoogle Faculty Research Awar

    Clustering high‐dimensional mixed data to uncover sub‐phenotypes: joint analysis of phenotypic and genotypic data

    Get PDF
    The LIPGENE-SU.VI.MAX study, like many others, recorded high-dimensional continuous phenotypic data and categorical genotypic data. LIPGENE-SU.VI.MAX focuses on the need to account for both phenotypic and genetic factors when studying the metabolic syndrome (MetS), a complex disorder that can lead to higher risk of type 2 diabetes and cardiovascular disease. Interest lies in clustering the LIPGENE-SU.VI.MAX participants into homogeneous groups or sub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model that elegantly accommodates high dimensional, mixed data is developed to cluster LIPGENE-SU.VI.MAX participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes ('healthy' and 'at risk') are uncovered. A small subset of variables is deemed discriminatory, which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, 7 years after the LIPGENE-SU.VI.MAX data were collected, participants underwent further analysis to diagnose presence or absence of the MetS. The two uncovered sub-phenotypes strongly correspond to the 7-year follow-up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition.Science Foundation IrelandInsight Research CentreEuropean Commission FP

    Prediction of tool-wear in turning of medical grade cobalt chromium molybdenum alloy (ASTM F75) using non-parametric Bayesian models

    Get PDF
    We present a novel approach to estimating the effect of control parameters on tool wear rates and related changes in the three force components in turning of medical grade Co-Cr-Mo (ASTM F75) alloy. Co-Cr-Mo is known to be a difficult to cut material which, due to a combination of mechanical and physical properties, is used for the critical structural components of implantable medical prosthetics. We run a designed experiment which enables us to estimate tool wear from feed rate and cutting speed, and constrain them using a Bayesian hierarchical Gaussian Process model which enables prediction of tool wear rates for untried experimental settings. However, the predicted tool wear rates are non-linear and, using our models, we can identify experimental settings which optimise the life of the tool. This approach has potential in the future for realtime application of data analytics to machining processes.Enterprise IrelandDePuy SynthesUpdate embargo when doing check date report - MEL 01/09/201

    A Protocol for Improved Precision and Increased Confidence in Nanoparticle Tracking Analysis Concentration Measurements between 50 and 120 nm in Biological Fluids

    No full text
    Nanoparticle tracking analysis (NTA) can be used to quantitate extracellular vesicles (EVs) in biological samples and is widely considered a useful diagnostic tool to detect disease. However, accurately profiling EVs can be challenging due to their small size and heterogeneity. Here, we aimed to provide a protocol to facilitate high-precision particle quantitation by NTA in plasma, the supernatant of activated purified platelets [the platelet releasate (PR)] and in serum, to increase confidence in NTA particle enumeration. The overall variance and the precision of NTA measurements were quantified by root mean square error and relative standard error. Using a bootstrapping approach, we found that increasing video replicates from 5 s × 60 s to 25 s × 60 s captures led to a reduction in overall variance and a reproducible increase in the precision of NTA particle-concentration quantitation for all three biofluids. We then validated our approach in an extended cohort of 32 healthy donors. Our results indicate that for vesicles sized between 50 and 120 nm, the precision of routine NTA measurements in serum, plasma, and PR can be significantly improved by increasing the number of video replicates captured. Our protocol provides a common platform to statistical compare particle size distribution profiles in the exosomal-vesicle size range across a variety of biofluids and in both healthy donor and patient groups.Health Research BoardIrish Research CouncilScience Foundation IrelandUCD School of Medicin
    corecore